Meta-analysis of whole-genome gene expression datasets assessing the effects of IDH1 and IDH2 mutations in isogenic disease models

Mutations in isocitrate dehydrogenase 1 (IDH1) and IDH2 are oncogenic drivers to a variable extent in several tumors, including gliomas, acute myeloid leukemia (AML), cholangiocarcinoma, melanoma, and thyroid carcinoma. The pathobiological effects of these mutations vary considerably, impeding the identification of common expression profiles. We performed an expression meta-analysis between IDH-mutant (IDHmut) and IDH-wild-type (IDHwt) conditions in six human and mouse isogenic disease models. The datasets included colon cancer cells, glioma cells, heart tissue, hepatoblasts, and neural stem cells. Among differentially expressed genes (DEGs), serine protease 23 (PRSS23) was upregulated in four datasets, i.e., in human colon carcinoma cells, mouse heart tissue, mouse neural stem cells, and human glioma cells. Carbonic anhydrase 2 (CA2) and prolyl 3-hydroxylase 2 (P3H2) were upregulated in three datasets, and SOX2 overlapping transcript (SOX2-OT) was downregulated in three datasets. The most significantly overrepresented protein class was termed intercellular signal molecules. An additional DEG set contained genes that were both up- and downregulated in different datasets and included oxidases and extracellular matrix structural proteins as the most significantly overrepresented protein classes. In conclusion, this meta-analysis provides a comprehensive overview of the expression effects of IDH mutations shared between different isogenic disease models. The generated dataset includes biomarkers, e.g., PRSS23 that may gain relevance for further research or clinical applications in IDHmut tumors.

. Expression studies on isogenic disease models included in the meta-analysis. a For the four mouse isogenic disease models, the percentage of mouse genes without corresponding human ortholog ranged between approximately 8.6% (mouse hepatoblasts), 12.8% (mouse neural stem cells), 17.1% (mouse heart tissue), and 18.7% (mouse glioma cells). b Low 2-HG expressing IDH2 R140Q cells were excluded from analysis and the respective parental cells were used as control. c The batch grown on uncoated plates for hepatocyte differentiation was excluded from analysis.  www.nature.com/scientificreports/ molecules ( Fig. 2a) and included BDNF, BMP4, RCAN1, SEMA7A, STC1, TGFB2, and WNT7A, all of which were comparably upregulated under IDH1/2 mut conditions. The most significantly overrepresented pathways included extracellular matrix organization, collagen biosynthesis and modifying enzymes, and collagen formation (Table 3). The most significantly associated networks were related to various diseases, conditions, and cellular functions ( Table 3). The top three networks were assembled with molecular relationship factors and displayed as a merged network (Fig. 3). Further interpretation of the DEG set was performed with the upstream regulator analysis tool ( Supplementary Fig. 1). Activated upstream regulators that were predicted to be most significantly associated with the DEG set comprised chorionic gonadotropin (CG) complex, cytokine WNT3A, transmembrane receptor IL10RA, and transcription factor TP53. The transporter APOE and cytokine IFNG were predicted to be the most significantly inhibited upstream regulators.
Ontology and pathway analysis of genes both up-and downregulated in the meta-analysis dataset. The most significantly overrepresented GO annotations in the DEG set included diverse morphogenic and developmental processes, extracellular matrix components, and various binding properties in the categories of biological process, cellular component, and molecular function, respectively (Fig. 2b). The most significantly overrepresented protein classes included oxidases (p = 3.36 × 10 -4 ), comprising PRODH, LOX, QSOX1, and QSOX2, and extracellular matrix structural proteins (p = 5.36 × 10 -4 ) comprising COL4A1, COL4A2, COL8A1, and FBN1 (Fig. 2b). The most significantly overrepresented pathways included integrin cell surface interactions, extracellular matrix organization, and post-translational protein phosphorylation (Table. 3). The most significantly associated networks were related to various developmental processes, diseases, conditions, and cellular functions ( Table 3). The top three networks were assembled with molecular relationship factors and displayed as a merged network (Fig. 4).

Discussion
In this meta-analysis, we compared the expression profiles of different IDH mut vs. IDH wt isogenic disease models to provide an overview of the nearly unbiased expression effects and the corresponding biological interpretations caused by the oncometabolite 2-HG. Although the statistical power of the IDH mut vs. IDH wt isogenic cell model datasets is generally lower than that of larger datasets generated in clinical tumor cases, the number of DEGs in proportion to the sample size is seemingly higher in isogenic cell models 21 . One likely explanation for this fact is that individual expression profiles vary considerably within IDH mut tumors, similar to as in other tumors, limiting the capacity to generate common expression profiles. However, in our meta-analysis, only a relatively low number of DEGs were shared between individual datasets, which can be attributed to the fact that different cancer and non-cancer isogenic disease models and experimental conditions were used as briefly outlined as follows: Using colon carcinoma cells, in which IDH1/2 mutations were inserted via a recombinant adeno-associated virus vector methodology, an epithelial-mesenchymal transition (EMT)-like phenotype and Table 2. Meta-analysis DEG sets compiled from individual datasets of isogenic disease models. Upregulated genes   ARPC5, BDH1, BDNF, BMP4, CA2, CACHD1, CACNB4, CCDC80, CHST11, CHST15, CLU, CSF1, CYBRD1, DHX37, DPYSL5, EPAS1,  EPDR1, FAM189A1, FOXF1, FSCN1, GALNS, GIMAP6, GRK5, HMOX1, KBTBD8, KIF3C, LGR4, MAML2, MANEAL, MAP1A, MCAM,  MEST, MYBL1, OTUB2, P3H2, PLD1, PODXL, PRSS23, RCAN1, S100A2, SCARB1, SDC1, SEMA7A, SEPTIN11, SERINC2, SLC38A3,  SPRY1, STC1, STK17A, TBC1D4, TGFB2,  www.nature.com/scientificreports/ changes in gene expression and cell morphology were observed 22 . In transgenic mouse models with conditional IDH2 mut coding sequences, activation of IDH2 mut expression at five weeks of age produced D-2HG leading to cardiomyopathy and neurodegeneration 23 . In hepatoblasts, isolated from mouse embryos at E14, a doxycycline-inducible system led to IDH1/2 mut gene expression 24 . The IDH1/2 mut hepatoblasts, which were cultured on collagen-coated plates, were refractory to differentiation. In neural stem cells derived from the cortex of mouse embryos at E14.5, Idh1 mut expression was induced via adenoviral-Cre-recombinase transduction 25 . In these cells, neuronal lineage differentiation was blocked, although differentiation-promoting culture conditions were utilized. Employing a mouse model that is susceptible to the development of gliomas, p53-deficient cells with vector-integrated IDH1 mut genes and cells containing a PDGF expression vector were coinjected into mice. The induced PDGF-driven gliomas showed reduced immune infiltration in comparison to the corresponding IDH1 wt glioma mouse model 21 . In an in vitro study, glioma cells were infected with lentivirus IDH1 mut coding sequences 26 . Doxycycline-induced IDH1 mut gene expression resulted in enhanced cell motility and morphological changes. The heterogeneity between the six isogenic disease models is exemplarily demonstrated by the diverse classification of the top pathways that were derived from the DEGs of each of the disease models ( Supplementary Fig. 2). The serine protease PRSS23 exhibits low tissue specificity in humans with the highest expression levels in female genital tract tissue and smooth muscle 27 . Studies in mice reported that PRSS23 is variably expressed in the preimplantation uterus and is possibly involved in tissue remodeling in the ovary 28,29 . The expression of PRSS23 has been detected in nuclei and extracellular vesicular exosomes where the protease is a component of the human secretome 30 . Exosomal PRSS23 is, e.g., involved in cardiovascular disease where the protease likely www.nature.com/scientificreports/ mediates Snail/alpha-smooth muscle actin signalling 31 . In cancer, PRSS23 is implicated in tumor progression, and it was identified in a systematic network survey of a meta-analysis of breast cancer microarray expression data as one of six genes involved in acquired lapatinib resistance 32 . Promoter studies in breast cancer cells indicated that PRSS23 is upregulated by estrogen receptor 1 (ESR1) and that its upregulated expression contributes to cell proliferation 33 . shRNA-mediated knockdown of PRSS23 in a gastric cancer xenograft mouse model resulted in a decrease in tumor volume and tumor weight 34 . Further in vitro experiments revealed that PRSS23 knockdown in gastric cancer cells apparently affected EIF2 pathway molecules. Based on a microarray study, PRSS23 was included in a gene classifier set that could discriminate papillary thyroid carcinoma from normal thyroid samples 35 . In head and neck, renal, and pancreatic cancer, PRSS23 expression is significantly associated with an unfavorable prognosis 30 . An epigenome-wide association study found, among several other DNA methylation sites, a significant association between changes of DNA methylation of DNA methylation sites at the PRSS23 gene and having a smoking habit but found no significant association with risk for lung cancer 36 . The BioGRID database currently curates about 50 PRSS23 interactors, among which actin and actin-related proteins constitute the most overrepresented PANTHER protein class (p-value = 3 × 10 -3 ) (Supplementary Fig. 3). Cytosolic CA2 is the physiologically predominant CA isoform and is known to interact with various acid/ base transporters 37 . These interactions are predicted to promote high glycolytic activity and cell proliferation in tumors. In lung cancer xenograft mouse models, shRNA-mediated knockdown of CA2 impaired tumor cell proliferation and angiogenesis and induced apoptosis 38 . Pharmacological studies exploring CA2 inhibitors are pursued to develop therapeutic options for the treatment of various conditions including cancer 39 . P3H family members consist of three isoenzymes in vertebrates. From a knockout study on P3H2 in a mouse embryonal carcinoma cell line, it can be presumed that the enzyme is the major posttranslational modifier of type IV collagen with 3-hydroxyproline, which is of significance for interactions of type IV collagen with other molecules 40 . High P3H2 expression in different parts of the CNS, gastrointestinal tract, and some other tissues has been reported; however, the enzyme exhibits no prognostic significance in cancer and reveals only weak-to-moderate staining in most cancer tissues 30 . The long non-coding RNA (lncRNA) SOX2-OT consists of several splice variants. SOX2, located in an intron of SOX2-OT, is transcribed in the same orientation as SOX2-OT and both are intensely expressed in embryonic stem cells 41 . SOX2-OT is implicated in neuronal and tumor development and progression. A meta-analysis of cancer datasets indicated that cancers with elevated SOX2-OT expression are significantly associated with unfavorable prognostic factors 42 . In two cervical cancer cell lines, a SOX2-OT transcript variant promoted cell growth, migration and invasion of the cells, indicating that the lncRNA may constitute a practical biomarker for cervical cancer 43 . However, lower expression of SOX2-OT was observed in Table 3. Top pathways and networks compiled from the meta-analysis DEG sets. www.nature.com/scientificreports/ Figure 3. The merged network is compiled from the top three networks that were most significantly associated with the DEGs, which were either up-or downregulated in at least two individual datasets ( www.nature.com/scientificreports/ gastric tumors compared to matched normal gastric samples, and lower expression was observed in high-grade rather than low-grade gastric tumors 44 . Furthermore, we assessed the similarity of expression profiles between the either up-or downregulated gene set from our meta-analysis with expression profiles of two publicly accessible datasets of low grade gliomas and chondrosarcomas, enabling us to compare IDH mut with IDH wt cancer samples [45][46][47] . The Venn diagram demonstrates that only a few DEGs are shared between our meta-analysis dataset and both clinical datasets (Supplementary Fig. 4). One likely explanation for this fact is that primary expression effects of an IDH mutation that emerge over days or weeks are measured in isogenic disease models, whereas clinical IDH mut tumors evolve over months or years and acquire multiple other genomic alterations before they become clinically evident.

Methods
Compilation of datasets from IDH1/2 mut vs. IDH1/2 wt isogenic disease models. Using the search term IDH to query the Gene Expression Omnibus (GEO), we designated 114 case series, out of which we identified seven whole-genome gene expression datasets derived from human and mouse isogenic disease models that compared IDH1/2 mut with IDH1/2 wt samples 48 . One dataset without publication reference with detailed information was deselected. We then selected the remaining six studies for further analysis. These studies contained at least biologically IDH1/2 mut triplicates and biologically IDH1/2 wt duplicates and the datasets of each of the studies were sufficiently significant to compile a DEG set based on an FDR-adjusted p-value ≤ 0.05 and an FC ≥ 1.5. In studies that employed an isogenic disease model with different IDH1/2 mutations, the raw datasets of the different IDH1/2 mutations were pooled and processed as a single IDH1/2 mutation dataset. The generated meta-analysis dataset includes GEO submissions GSE41802 22 , GSE54838 23 , GSE57002 24 , GSE88828 25 , GSE96979 21 , and GSE147223 26 . Using the same above-quoted search strategy, no additional datasets were identified in another publicly accessible repository for high-throughput functional genomics experiments 49 . The database repositories were essentially interrogated in November 2020.
Generation of DEG sets. For four microarray GEO datasets, the binary CEL files comprising the intensity calculations were imported into Transcriptome Analysis Console (TAC) version 4.0.2.15 (Thermo Fisher Scientific, Waltham, MA). TAC includes the LIMMA (linear models for microarray data) statistical package from Bioconductor 50 . The binary CEL files were normalized in TAC and files of differentially expressed probe sets were compiled using eBayes correction in ANOVA. For the study utilizing the expression BeadChips, the normalized dataset was analyzed using the NetworkAnalyst 3.0 platform, which employs LIMMA statistics to generate differentially expressed probe sets 51 . For genes with more than one probe set in a dataset, the probe set with the highest FC was selected for further analysis; however, genes, with both significantly up-and downregulated probe sets in the same dataset, were excluded from further analysis. For the RNA-seq dataset, the publicly accessible Sequence Read Archive (SRA) datasets were downloaded from the NCBI resource 52 . We aligned the RNA-seq reads to the human reference genome assembly GRCh37 (hg19), using STAR aligner 53 . Then, the R package DESeq2 was used to normalize count data, remove outliers, determine filtering thresholds, and find genes that were significantly differentially expressed between the IDH1 mut and IDH1 wt groups 54  Ontology and pathway analysis. For further analysis of DEGs, which were based on an FDR-adjusted p-value ≤ 0.05 and an FC ≥ 1.5, the statistical overrepresentation test of the GO program PANTHER v. 16.0 was employed to interrogate annotation datasets in the categories of biological process, cellular component, molecular function, protein classes, and Reactome pathways 59 . The PANTHER protein class ontology comprises commonly used classes of protein functions. The Reactome pathway analysis specifies the biological relationships between interacting molecules such as nucleic acids, proteins, and compounds. For all annotation datasets, a Fisher's exact test p-value < 0.05 indicated statistical significance. The BioGRID build 4.1 database was queried for protein interactors 60 . BioGRID curates protein, genetic and chemical interactions from various biomedical studies and datasets. The Ingenuity Pathway Analysis (IPA) software v. 68,752,261 (Qiagen, Hilden, Germany) was employed for further multifactorial interpretation of the gene sets. IPA utilizes the curated Ingenuity knowledge base as a reference dataset to interfere molecular relationships. Fisher's exact test p-values indicated the significance of associations between analyzed dataset molecules and functional frameworks prebuilt or generated de novo by IPA. The molecule activity predictor was applied to predict expression effects/coherence of the expression effects of a molecule on other network molecules. Direct molecular relationships were used to survey the significance of fit, indicated as a score value, between molecules of uploaded gene sets and networks associated with specific functions or diseases. Direct and indirect molecular relationships were used for upstream regulator network analysis to investigate how upstream regulators affect differences in target gene expression. A z-score value indicates the activation/inhibition state of an upstream regulator.

Data availability
The raw datasets analyzed in the study are available at the GEO repository. www.nature.com/scientificreports/